Optimization for Feature Selection in Dna Microarrays
نویسندگان
چکیده
We present two methods for feature selection in high throughput transcriptomic data, in which the subsets of selected variables (the genes) are optima of a multi-objective function. In the clinical trials, the number of embedded patient cases is never higher than in the hundreds, while the number of gene expressions measured for each patient is higher than tens of thousands. These trials aim to better understand the biology of the phenotypes at the genomic level, and to better predict the phenotypes in order to give each patient the best treatment. Our first method states that the gene subsets are the optima of a bi-objective function. This function is a tradeoff between the size of the gene subset and the discrimination of the phenotypes, expressed as the inter-class distance. Because the gene selection stage is independent of the prediction model, it is a filter method of feature selection. The second method aims to select gene subsets that will optimize the performance of a specific prediction model. It is a wrapper approach of the feature selection problem. The optimal gene subsets are computed by a line search optimization heuristic which maximizes the performances of a linear discriminant analysis. Using public datasets in oncology we compared our results to those of the main previous methods. Our optimization approach of the gene subset selection almost always returned subsets that were significantly smaller than those of the previous methods, the performance of our predictors almost always being higher, and being more robust. In the two methods we searched the space of gene subsets for optima of an explicit multiobjective function. Meta-heuristic methods are well suited to address these optimization problems, specifically in high dimensional spaces.
منابع مشابه
Diagnosis of the disease using an ant colony gene selection method based on information gain ratio using fuzzy rough sets
With the advancement of metagenome data mining science has become focused on microarrays. Microarrays are datasets with a large number of genes that are usually irrelevant to the output class; hence, the process of gene selection or feature selection is essential. So, it follows that you can remove redundant genes and increase the speed and accuracy of classification. After applying the gene se...
متن کاملA Classification Method for E-mail Spam Using a Hybrid Approach for Feature Selection Optimization
Spam is an unwanted email that is harmful to communications around the world. Spam leads to a growing problem in a personal email, so it would be essential to detect it. Machine learning is very useful to solve this problem as it shows good results in order to learn all the requisite patterns for classification due to its adaptive existence. Nonetheless, in spam detection, there are a large num...
متن کاملFeature Selection in Structural Health Monitoring Big Data Using a Meta-Heuristic Optimization Algorithm
This paper focuses on the processing of structural health monitoring (SHM) big data. Extracted features of a structure are reduced using an optimization algorithm to find a minimal subset of salient features by removing noisy, irrelevant and redundant data. The PSO-Harmony algorithm is introduced for feature selection to enhance the capability of the proposed method for processing the measure...
متن کاملFuzzy-rough Information Gain Ratio Approach to Filter-wrapper Feature Selection
Feature selection for various applications has been carried out for many years in many different research areas. However, there is a trade-off between finding feature subsets with minimum length and increasing the classification accuracy. In this paper, a filter-wrapper feature selection approach based on fuzzy-rough gain ratio is proposed to tackle this problem. As a search strategy, a modifie...
متن کاملA New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier
With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...
متن کاملتعیین ماشینهای بردار پشتیبان بهینه در طبقهبندی تصاویر فرا طیفی بر مبنای الگوریتم ژنتیک
Hyper spectral remote sensing imagery, due to its rich source of spectral information provides an efficient tool for ground classifications in complex geographical areas with similar classes. Referring to robustness of Support Vector Machines (SVMs) in high dimensional space, they are efficient tool for classification of hyper spectral imagery. However, there are two optimization issues which s...
متن کامل